Isotropic Dynamic Hierarchical Clustering
نویسندگان
چکیده
We face a business need of discovering a pattern in locations of a great number of points in a high-dimensional space. We assume that there should be a certain structure, so that in some locations the points are close while in other locations the points are more dispersed. Our goal is to group the close points together. The process of grouping close objects is known under the name of clustering. 1. We are particularly interested in a hierarchical structure. A plain structure may reduce the number of objects, but the data are still difficult to manage or present. 2. The classical technique suited for the task at hand is a B-Tree. The key properties of the B-Tree are that it is hierarchical and balanced, and it can be dynamically constructed from the input data. In these terms, B-Tree has certain advantages over other clustering algorithms, where the number of clusters needs to be defined a priori. The BTree approach allows to hope that the structure of input data will be well determine without any supervised learning. 3. The space is Euclidean and isotropic. This is the most challenging part of the project, because currently there are no B-Tree implementations processing indices in a symmetrical and isotropical way. Some known implementations are based on constructing compound asymmetrical indices from point coordinates, where the main index works as a key, while the function of other (999!) indices is lost; and the other known implementations split the nodes along the coordinate hyper-planes, sacrificing the isotropy of the original space. In the latter case the clusters become coordinate parallelepipeds, which is a rather artificial and unnecessary assumption. Our implementation of a B-Tree for a high-dimensional space is based directly on concepts of factor analysis. 4. We need to process a great deal of data, something like tens of millions of points in a thousand-dimensional space. The application has to be scalable, even though, technically, out task is not considered a true Big Data problem. We use dispersed data structures, and optimized algorithms. Ideally, a cluster should be an ellipsoid in a high-dimensional space, but such implementation would require to store O(n) ellipse axes, which is impractical. So, we are using multi-dimensional balls defined by the centers and radii. On the other hand, calculation of statistical values like the mean and the average deviation, can be done in an incremental way. This mean that when adding a point to a tree, the statistical values for nodes of all levels may be recalculated in O(1) time. The node statistical values are used to split the overloaded nodes in an optimal way. We support both, brute force O(2) and greedy O(n) split algorithms. Statistical and aggregated node information also allows to manipulate (to search, to delete) aggregated sets of closely located points. 5. Hierarchical information retrieval. When searching, the user is provided with the highest appropriate nodes in the tree hierarchy, with the most important clusters emerging in the hierarchy automatically. Then, if interested, the user may navigate down the tree to more specific points. The system is implemented as a library of Java classes representing Points in multi-dimensional space, Sets of points with aggregated statistical information (mean, standard deviation,) B-tree, and Nodes with a support of serialization and storage in a MySQL database.
منابع مشابه
Isotropic Clustering for Hierarchical Radiosity | Implementation and Experiences
Although Hierarchical Radiosity was a big step forward for nite element computations in the context of global illumination, the algorithm can hardly cope with scenes of more than medium complexity. The reason is that Hierarchical Radiosity requires an initial linking step, comparing all pairs of initial objects in the scene. These initial objects are then hierarchically subdivided in order to a...
متن کاملروش نوین خوشهبندی ترکیبی با استفاده از سیستم ایمنی مصنوعی و سلسله مراتبی
Artificial immune system (AIS) is one of the most meta-heuristic algorithms to solve complex problems. With a large number of data, creating a rapid decision and stable results are the most challenging tasks due to the rapid variation in real world. Clustering technique is a possible solution for overcoming these problems. The goal of clustering analysis is to group similar objects. AIS algor...
متن کاملDynamic Hierarchical Compact Clustering Algorithm
In this paper we introduce a general framework for hierarchical clustering that deals with both static and dynamic data sets. From this framework, different hierarchical agglomerative algorithms can be obtained, by specifying an inter-cluster similarity measure, a subgraph of the β-similarity graph, and a cover algorithm. A new clustering algorithm called Hierarchical Compact Algorithm and its ...
متن کاملGraph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members
Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...
متن کاملThe New Software Package for Dynamic Hierarchical Clustering for Circles Types of Shapes
In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases. ...
متن کاملImproving the Dynamic Hierarchical Compact Clustering Algorithm by Using Feature Selection
Feature selection has improved the performance of text clustering. In this paper, a local feature selection technique is incorporated in the dynamic hierarchical compact clustering algorithm to speed up the computation of similarities. We also present a quality measure to evaluate hierarchical clustering that considers the cost of finding the optimal cluster from the root. The experimental resu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1605.07030 شماره
صفحات -
تاریخ انتشار 2016